Automated Annotation Workflow

This workflow uses the auto_annot tools from besca to newly annotate a scRNAseq dataset based on one or more preannotated datasets. Ideally, these datasets come from a similar tissue and condition.

We use supervised machine learning methods to annotate each individual cell utilizing methods like support vector machines (SVM) or logistic regression.

First, the traning dataset(s) and the testing dataset are loaded from h5ad files or made available as adata objects. Next, the training and testing datasets are corrected using scanorama, and the training datasets are then merged into one anndata object. Then, the classifier is trained utilizing the merged training data. Finally, the classifier is applied to the testing dataset to predict the cell types. If the testing dataset is already annotated (to test the algorithm), a report including confusion matrices can be generated.

In [1]:
import besca as bc
In [2]:
import scanpy as sc
import pkg_resources

test load datasets with scvelo

Apparently the scv loader makes sure the adata objects are all in comparable format whereas the sc loader loads them as is.

In [3]:
adata_test = bc.datasets.Kotliarov2020_processed()
In [4]:
adata_test_orig = bc.datasets.Kotliarov2020_processed()
In [5]:
adata_train1 = bc.datasets.Granja2019_processed()

Concatenation does not lead to errors when the scv loader is used.

In [6]:
adata_train_list = [adata_train1]

Parameter specification

Give your analysis a name.

In [7]:
analysis_name = 'auto_annot_pubimage_trainGtestK' # The analysis name will be used to name the output files

Specify column name of celltype annotation you want to train on.

In [8]:
celltype ='dblabel' # This needs to be a column in the .obs of the training datasets (and test dataset if you want to generate a report)

Choose a method:

  • linear: Support Vector Machine with Linear Kernel
  • sgd: Support Vector Machine with Linear Kernel using Stochastic Gradient Descent
  • rbf: Support Vector Machine with radial basis function kernel. Very time intensive, use only on small datasets.
  • logistic_regression: Standard logistic classifier iwth multinomial loss.
  • logistic_regression_ovr: Logistic Regression with one versus rest classification.
  • logistic_regression_elastic: Logistic Regression with elastic loss, cross validates among multiple l1 ratios.
In [9]:
method = 'logistic_regression'

Specify merge method. Needs to be either scanorama or naive.

In [10]:
merge = 'scanorama' # We recommend to use scanorama here

Decide if you want to use the raw format or highly variable genes. Raw increases computational time and does not necessarily improve predictions.

In [11]:
use_raw = False # We recommend to use False here

You can choose to only consider a subset of genes from a signature set or use all genes.

In [12]:
genes_to_use = 'all' # We suggest to use all here, but the runtime is strongly improved if you select an appropriate gene set

Column names need to be standardised so the function knows which columns to compare.

In [13]:
adata_train_list[0].obs["dblabel"] = adata_train_list[0].obs.celltype3
adata_test.obs["dblabel"] = adata_test.obs.celltype3
adata_test_orig.obs["dblabel"] = adata_test_orig.obs.celltype3
In [14]:
adata_test.obs.dblabel.unique()
Out[14]:
[cytotoxic CD56-dim natural killer cell, naive thymus-derived CD8-positive, alpha-beta ..., naive thymus-derived CD4-positive, alpha-beta ..., classical monocyte, CD8-positive, alpha-beta cytotoxic T cell, ..., regulatory T cell, CD1c-positive myeloid dendritic cell, plasmacytoid dendritic cell, erythrocyte, plasma cell]
Length: 14
Categories (14, object): [cytotoxic CD56-dim natural killer cell, naive thymus-derived CD8-positive, alpha-beta ..., naive thymus-derived CD4-positive, alpha-beta ..., classical monocyte, ..., CD1c-positive myeloid dendritic cell, plasmacytoid dendritic cell, erythrocyte, plasma cell]
In [15]:
adata_train_list[0].obs.dblabel.unique()
Out[15]:
[naive thymus-derived CD4-positive, alpha-beta ..., classical monocyte, naive B cell, lymphocyte of B lineage, naive thymus-derived CD8-positive, alpha-beta ..., ..., IL7R-max CD8-positive, alpha-beta cytotoxic T ..., hematopoietic multipotent progenitor cell, myeloid leukocyte, basophil, plasma cell]
Length: 25
Categories (25, object): [naive thymus-derived CD4-positive, alpha-beta ..., classical monocyte, naive B cell, lymphocyte of B lineage, ..., hematopoietic multipotent progenitor cell, myeloid leukocyte, basophil, plasma cell]
In [16]:
adata_test.var.dtypes
Out[16]:
ENSEMBL         category
SYMBOL            object
feature_type    category
n_cells          float64
total_counts     float32
frac_reads       float32
dtype: object
In [17]:
adata_train_list[0].var.dtypes
Out[17]:
ENSEMBL           object
SYMBOL            object
feature_type    category
n_cells            int64
total_counts     float32
frac_reads       float32
dtype: object

Correct datasets (e.g. using scanorama) and merge training datasets

This function merges training datasets, removes unwanted genes, and if scanorama is used corrects for datasets.

In [18]:
adata_train, adata_test_corrected = bc.tl.auto_annot.merge_data(adata_train_list, adata_test, genes_to_use = genes_to_use, merge = merge)
merging with scanorama
using scanorama rn
Found 640 genes among all datasets
[[0.         0.62571453]
 [0.         0.        ]]
Processing datasets (0, 1)
integrating training set
calculating intersection

Train the classifier

The returned scaler is fitted on the training dataset (to zero mean and scaled to unit variance). The scaling will then be applied to the counts in the testing dataset and then the classifier is applied to the scaled testing dataset (see next step, adata_predict()). This function will run multiple jobs in parallel if if logistic regression was specified as method.

In [19]:
classifier, scaler = bc.tl.auto_annot.fit(adata_train, method, celltype, njobs=10)
[Parallel(n_jobs=10)]: Using backend LokyBackend with 10 concurrent workers.
[Parallel(n_jobs=10)]: Done   5 out of   5 | elapsed:  3.1min finished

Prediction

If in addition to the most likely class you would like to have all class probabilities returned use the following function. (This is only a sensible choice if using logistic regression.)

In [20]:
adata_predicted = bc.tl.auto_annot.adata_pred_prob(classifier = classifier, scaler = scaler, adata_pred = adata_test_corrected, adata_orig = adata_test_orig, threshold = 0.0)

Output

The adata object that includes the predicted cell type annotation can be written out as h5ad file.

In [21]:
adata_predicted.write('./adata_predicted_trainGtestK.h5ad')
... storing 'auto_annot' as categorical

If the testing dataset included already a cell type annotation, a report can be generated and written, which includes metrics, confusion matrices and comparative umap plots.

In [22]:
adata_predicted.obs
Out[22]:
CELL CONDITION sample_type donor tenx_lane cohort batch sampleid timepoint percent_mito ... myeloid leukocyte naive B cell naive thymus-derived CD4-positive, alpha-beta T cell naive thymus-derived CD8-positive, alpha-beta T cell neutrophil non-classical monocyte plasma cell plasmacytoid dendritic cell pro-B cell regulatory T cell
10X_CiteSeq_donor256.AAACCTGAGAGCCCAA-1 10X_CiteSeq_donor256.AAACCTGAGAGCCCAA-1 PBMC_healthy PBMC donor256 H1B1ln1 H1N1 1 256 d0 0.041481 ... 0.000029 1.017329e-03 1.135649e-05 0.000007 0.000128 1.147308e-04 2.907648e-05 7.386506e-05 3.212233e-06 5.973350e-06
10X_CiteSeq_donor273.AAACCTGAGGCGTACA-1 10X_CiteSeq_donor273.AAACCTGAGGCGTACA-1 PBMC_healthy PBMC donor273 H1B1ln1 H1N1 1 273 d0 0.049020 ... 0.000017 1.308397e-07 9.010606e-01 0.060718 0.000132 8.882910e-09 1.411388e-06 1.189191e-06 2.879767e-06 2.029840e-03
10X_CiteSeq_donor256.AAACCTGCAGGTGGAT-1 10X_CiteSeq_donor256.AAACCTGCAGGTGGAT-1 PBMC_healthy PBMC donor256 H1B1ln1 H1N1 1 256 d0 0.017332 ... 0.002209 1.327423e-04 2.674656e-01 0.021280 0.046810 1.717363e-04 1.558447e-04 6.355526e-05 2.098383e-03 2.789821e-02
10X_CiteSeq_donor200.AAACCTGCAGTATCTG-1 10X_CiteSeq_donor200.AAACCTGCAGTATCTG-1 PBMC_healthy PBMC donor200 H1B1ln1 H1N1 1 200 d0 0.017222 ... 0.030953 1.297322e-03 6.006677e-03 0.003211 0.257943 2.700862e-03 3.514198e-03 8.462183e-04 4.006820e-04 4.314447e-03
10X_CiteSeq_donor233.AAACCTGCATCACAAC-1 10X_CiteSeq_donor233.AAACCTGCATCACAAC-1 PBMC_healthy PBMC donor233 H1B1ln1 H1N1 1 233 d0 0.033969 ... 0.000077 1.779970e-06 2.680332e-05 0.000004 0.013941 4.247383e-04 1.429462e-06 2.449836e-07 4.636980e-07 6.568480e-06
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
10X_CiteSeq_donor205.TTTGTCAGTACGACCC-1 10X_CiteSeq_donor205.TTTGTCAGTACGACCC-1 PBMC_healthy PBMC donor205 H1B2ln6 H1N1 2 205 d0 0.034225 ... 0.000057 1.580758e-05 2.416210e-04 0.000060 0.000062 1.168287e-05 6.090665e-06 6.045835e-05 3.355470e-06 5.801216e-07
10X_CiteSeq_donor205.TTTGTCAGTCAAACTC-1 10X_CiteSeq_donor205.TTTGTCAGTCAAACTC-1 PBMC_healthy PBMC donor205 H1B2ln6 H1N1 2 205 d0 0.029844 ... 0.000874 1.908833e-04 8.544186e-03 0.000868 0.005051 4.756444e-04 6.964237e-05 2.571614e-04 6.210471e-06 7.373240e-05
10X_CiteSeq_donor268.TTTGTCATCCCATTTA-1 10X_CiteSeq_donor268.TTTGTCATCCCATTTA-1 PBMC_healthy PBMC donor268 H1B2ln6 H1N1 2 268 d0 0.014418 ... 0.000014 1.362234e-04 3.287859e-06 0.000001 0.000748 2.168789e-04 3.821204e-05 2.064314e-05 2.515979e-06 1.503870e-04
10X_CiteSeq_donor234.TTTGTCATCGAGAACG-1 10X_CiteSeq_donor234.TTTGTCATCGAGAACG-1 PBMC_healthy PBMC donor234 H1B2ln6 H1N1 2 234 d0 0.032220 ... 0.000018 1.300415e-05 4.287252e-07 0.000002 0.001983 3.856882e-03 6.350399e-07 1.109220e-07 3.629984e-08 3.480198e-06
10X_CiteSeq_donor205.TTTGTCATCTACCTGC-1 10X_CiteSeq_donor205.TTTGTCATCTACCTGC-1 PBMC_healthy PBMC donor205 H1B2ln6 H1N1 2 205 d0 0.019579 ... 0.048401 5.409277e-03 1.419858e-01 0.000894 0.006642 1.749978e-02 1.265990e-03 3.208276e-04 7.637871e-05 2.750983e-02

47511 rows × 47 columns

In [23]:
adata_predicted = bc.st.clustering(adata_predicted, '.')
leiden clustering performed with a resolution of 1
WARNING: saving figure to file figures/umap.leiden.png
rank genes per cluster calculated using method wilcoxon.
mapping of cells to  leiden exported successfully to cell2labels.tsv
average.gct exported successfully to file
fract_pos.gct exported successfully to file
labelinfo.tsv successfully written out
./labelings/leiden/WilxRank.gct written out
./labelings/leiden/WilxRank.pvalues.gct written out
./labelings/leiden/WilxRank.logFC.gct written out
In [24]:
%matplotlib inline
sc.settings.set_figure_params(dpi=90)
bc.tl.auto_annot.report(adata_predicted, celltype, method, analysis_name, False, merge, use_raw, genes_to_use, clustering = 'leiden')
WARNING: saving figure to file figures/umap.ondata_auto_annot_pubimage_trainGtestK.png
WARNING: saving figure to file figures/umap.auto_annot_pubimage_trainGtestK.png
Confusion matrix, without normalization
Normalized confusion matrix
In [25]:
sc.settings.set_figure_params(dpi=240)

sc.pl.umap(adata_predicted, color=[celltype, 'auto_annot', 'leiden'], legend_loc='on data',legend_fontsize=7,  save= '.fig3_supp_trainGtestKondata.svg')
sc.pl.umap(adata_predicted, color=[celltype, 'auto_annot', 'leiden'],legend_fontsize=7, wspace = 1.4, save = '.fig3_supp_trainGtestK.svg')
WARNING: saving figure to file figures/umap.fig3_supp_trainGtestKondata.svg
WARNING: saving figure to file figures/umap.fig3_supp_trainGtestK.svg
In [26]:
import matplotlib.pyplot as plt
import matplotlib
import numpy as np
from sklearn.metrics import confusion_matrix
def plot_confusion_matrix(y_true, y_pred, classes, celltype,
                          normalize=False,
                          title=None, numbers =False,
                          cmap=plt.cm.Blues, adata_predicted= None, asymmetric_matrix = True): 

    matplotlib.use('Agg')
    
    if not title:
        if normalize:
            title = 'Normalized confusion matrix'
        else:
            title = 'Confusion matrix, without normalization'

    # Compute confusion matrix
    cm = confusion_matrix(y_true, y_pred)
    # Only use the labels that appear in the data
    #classes = classes[unique_labels(y_true, y_pred)]
    if asymmetric_matrix == True:
        class_names =  np.unique(np.concatenate((adata_predicted.obs[celltype], adata_predicted.obs['auto_annot'])))
        class_names_orig = np.unique(adata_predicted.obs[celltype])
        class_names_pred = np.unique(adata_predicted.obs['auto_annot'])
        test_celltypes_ind = np.searchsorted(class_names, class_names_orig)
        train_celltypes_ind = np.searchsorted(class_names, class_names_pred)
        cm=cm[test_celltypes_ind,:][:,train_celltypes_ind]
    
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    fig, ax = plt.subplots(figsize=(15,15))
    im = ax.imshow(cm, interpolation='nearest', cmap=cmap)
    ax.figure.colorbar(im, ax=ax, shrink = 0.8)
    # We want to show all ticks...
    if asymmetric_matrix == True:
        ax.set(xticks=np.arange(cm.shape[1]),
               yticks=np.arange(cm.shape[0]),
               # ... and label them with the respective list entries
               xticklabels=class_names_pred, yticklabels=class_names_orig,
               title=title,
               ylabel='True label',
               xlabel='Predicted label')
    else:
        ax.set(xticks=np.arange(cm.shape[1]),
               yticks=np.arange(cm.shape[0]),
               # ... and label them with the respective list entries
               xticklabels=classes, yticklabels=classes,
               title=title,
               ylabel='True label',
               xlabel='Predicted label')
        
    ax.grid(False)
    #ax.tick_params(axis='both', which='major', labelsize=10)
    # Rotate the tick labels and set their alignment.
    plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
             rotation_mode="anchor")

    # Loop over data dimensions and create text annotations.
    if numbers == True:
        fmt = '.2f' if normalize else 'd'
        thresh = cm.max() / 2.
        for i in range(cm.shape[0]):
            for j in range(cm.shape[1]):
                ax.text(j, i, format(cm[i, j], fmt),
                        ha="center", va="center",
                        color="white" if cm[i, j] > thresh else "black")
    #fig.tight_layout()
    return ax
In [27]:
import os
In [28]:
# make conf matrices (4)
class_names =  np.unique(np.concatenate((adata_predicted.obs[celltype], adata_predicted.obs['auto_annot'])))
np.set_printoptions(precision=2)
# Plot non-normalized confusion matrix
plot_confusion_matrix(adata_predicted.obs[celltype], adata_predicted.obs['auto_annot'], title = " ", classes=class_names, celltype=celltype ,numbers = False, adata_predicted = adata_predicted, asymmetric_matrix = True)
plt.savefig(os.path.join('fig3_supp_trainGtestK_confusion_matrix_nonnormalised.svg'))

# Plot normalized confusion matrix with numbers
plot_confusion_matrix(adata_predicted.obs[celltype], adata_predicted.obs['auto_annot'], title = " ", classes=class_names,celltype=celltype,  normalize=True, numbers = False, adata_predicted = adata_predicted, asymmetric_matrix = True)
plt.savefig(os.path.join('fig3_supp_trainGtestK_confusion_matrix_normalised.svg'))
Confusion matrix, without normalization
Normalized confusion matrix
In [29]:
bc.pl.riverplot_2categories(adata_predicted, [celltype, 'auto_annot'])

let's use a threshold

In [30]:
analysis_name = 'auto_annot_pubimage_threshold_trainGtestK' # The analysis name will be used to name the output files
In [31]:
adata_predicted_threshold = bc.tl.auto_annot.adata_pred_prob(classifier = classifier, scaler = scaler, adata_pred = adata_test_corrected, adata_orig = adata_test_orig, threshold = 0.7)
In [32]:
adata_predicted_threshold.write('./adata_predicted_threshold_trainGtestK.h5ad')
... storing 'auto_annot' as categorical
In [33]:
%matplotlib inline
sc.settings.set_figure_params(dpi=90)
bc.tl.auto_annot.report(adata_predicted_threshold, celltype, method, analysis_name, False, merge, use_raw, genes_to_use, clustering = 'leiden')
WARNING: saving figure to file figures/umap.ondata_auto_annot_pubimage_threshold_trainGtestK.png
WARNING: saving figure to file figures/umap.auto_annot_pubimage_threshold_trainGtestK.png
Confusion matrix, without normalization
Normalized confusion matrix
In [34]:
sc.settings.set_figure_params(dpi=240)

sc.pl.umap(adata_predicted_threshold, color=[celltype, 'auto_annot', 'leiden'], legend_loc='on data',legend_fontsize=7,  save= '.fig3_supp_trainGtestK_threshold_ondata.svg')
sc.pl.umap(adata_predicted_threshold, color=[celltype, 'auto_annot', 'leiden'],legend_fontsize=7, wspace = 1.4, save = '.fig3_supp_trainGtestK_threshold.svg')
WARNING: saving figure to file figures/umap.fig3_supp_trainGtestK_threshold_ondata.svg
WARNING: saving figure to file figures/umap.fig3_supp_trainGtestK_threshold.svg
In [35]:
# make conf matrices (4)
class_names =  np.unique(np.concatenate((adata_predicted_threshold.obs[celltype], adata_predicted_threshold.obs['auto_annot'])))
np.set_printoptions(precision=2)
# Plot non-normalized confusion matrix
plot_confusion_matrix(adata_predicted_threshold.obs[celltype], adata_predicted_threshold.obs['auto_annot'], title = " ", classes=class_names, celltype=celltype ,numbers = False, adata_predicted = adata_predicted_threshold, asymmetric_matrix = True)
plt.savefig(os.path.join('fig3_supp_trainG_testK_confusion_matrix_threshold_nonnormalised.svg'))

# Plot normalized confusion matrix with numbers
plot_confusion_matrix(adata_predicted_threshold.obs[celltype], adata_predicted_threshold.obs['auto_annot'], title = " ", classes=class_names,celltype=celltype,  normalize=True, numbers = False, adata_predicted = adata_predicted_threshold, asymmetric_matrix = True)
plt.savefig(os.path.join('fig3_supp_trainG_testK_confusion_matrix_threshold_normalised.svg'))
Confusion matrix, without normalization
Normalized confusion matrix
In [36]:
bc.pl.riverplot_2categories(adata_predicted_threshold, [celltype, 'auto_annot'])
In [37]:
adata_predicted_wo_unknown = adata_predicted_threshold.copy()
adata_predicted_wo_unknown = bc.subset_adata(adata_predicted_wo_unknown, adata_predicted_wo_unknown.obs.auto_annot != 'unknown', raw=False)
bc.pl.riverplot_2categories(adata_predicted_wo_unknown, [celltype, 'auto_annot'])

let's check if the differences in annotation make sense

In [38]:
gmt_file_IMM=pkg_resources.resource_filename('besca', 'datasets/genesets/HumanCD45p_scseqCMs6.gmt')
bc.tl.sig.combined_signature_score(adata_predicted, gmt_file_IMM)
WARNING: genes are not in var_names and ignored: ['FCRL4']
WARNING: genes are not in var_names and ignored: ['FCGR3']
WARNING: genes are not in var_names and ignored: ['ENPP3']
WARNING: genes are not in var_names and ignored: ['CASP8AP2', 'DSSC1', 'E2F8', 'EXO1']
WARNING: genes are not in var_names and ignored: ['ANLN', 'CSK2', 'NEK2']
WARNING: genes are not in var_names and ignored: ['FAP', 'THY1', 'DCN', 'COL1A1', 'COL1A2', 'CXCL14', 'LUM', 'COL3A1', 'DPT', 'ISLR', 'PODN', 'FDF7', 'PDGFRL']
WARNING: genes are not in var_names and ignored: ['TNFA', 'IL4', 'IL7A', 'IL12', 'IL13', 'IL21', 'IL22', 'IL23', 'CXCL5', 'CXCL9', 'CXCL11', 'CXCL12', 'CXCL13', 'CX3CL1', 'GM-CSF', 'GCSFCCL1', 'CCL7', 'CCL11', 'CCL12', 'CCL13', 'CCL17', 'CCL19', 'CCL22', 'CCL25', 'CCL24', 'CCL26', 'SDF1A', 'BCA1', 'MIP1B']
WARNING: genes are not in var_names and ignored: ['LY6C1', 'SIGLECH']
WARNING: genes are not in var_names and ignored: ['CDH5', 'ITCAM1', 'ITGB3', 'KDR', 'MCAM', 'PECAM1', 'SELE', 'TEK', 'VCAM1', 'VWF']
WARNING: genes are not in var_names and ignored: ['CDH5', 'ITGB3', 'KDR', 'MCAM', 'PECAM1', 'SELE', 'TEK', 'VCAM1', 'VWF']
WARNING: genes are not in var_names and ignored: ['PECAM1', 'VWF', 'CDH5', 'ECSCR', 'CCL14', 'SLCO2A1', 'KDR', 'ERG', 'FABP4']
WARNING: genes are not in var_names and ignored: ['IL9R', 'SLIGLEC10', 'SIGLEC8']
WARNING: genes are not in var_names and ignored: ['EPCAM', 'KRT19']
WARNING: genes are not in var_names and ignored: ['TILPL2']
WARNING: genes are not in var_names and ignored: ['HLA-H', 'HLA-L', 'HLA-DRB2']
WARNING: genes are not in var_names and ignored: ['CDH1']
WARNING: genes are not in var_names and ignored: ['OAS1G']
WARNING: genes are not in var_names and ignored: ['CXCL9']
WARNING: genes are not in var_names and ignored: ['ADGRE1', 'APOE']
WARNING: genes are not in var_names and ignored: ['ENPP3']
WARNING: genes are not in var_names and ignored: ['ITGB3', 'PECAM1']
WARNING: genes are not in var_names and ignored: ['MIA', 'TYR', 'SLC45A2', 'CDH19', 'PMEL', 'SLC24A5', 'MAGEA6', 'GJB1', 'PLP1', 'PRAME', 'PAX3', 'S100A1', 'MLANA']
WARNING: genes are not in var_names and ignored: ['SLIT2', 'BGN', 'TNC', 'CYR6', 'GFRA3', 'SLITRK6', 'AQP1']
WARNING: genes are not in var_names and ignored: ['IGHG1', 'IGHG2', 'IGHA1']
WARNING: genes are not in var_names and ignored: ['FCGR3']
WARNING: genes are not in var_names and ignored: ['FCGR3']
WARNING: genes are not in var_names and ignored: ['FCGR3', 'FCGR1']
WARNING: genes are not in var_names and ignored: ['FCGR4', 'FCGR1']
WARNING: genes are not in var_names and ignored: ['LY6G', 'CD177']
WARNING: genes are not in var_names and ignored: ['TRDC']
WARNING: genes are not in var_names and ignored: ['IGHD', 'IGHM']
WARNING: genes are not in var_names and ignored: ['CEACAM8']
WARNING: genes are not in var_names and ignored: ['IGF1', 'ITGA8']
WARNING: genes are not in var_names and ignored: ['CDH2', 'LRP5', 'SMO', 'SOX9']
WARNING: genes are not in var_names and ignored: ['SOX9']
WARNING: genes are not in var_names and ignored: ['MMP1', 'MMP2', 'PDGFRA', 'PECAM1', 'THY1', 'VCAM1']
WARNING: genes are not in var_names and ignored: ['TRADO']
WARNING: genes are not in var_names and ignored: ['APOE', 'CXCL12', 'CD209']
WARNING: genes are not in var_names and ignored: ['CXCL11', 'CXCL9']
WARNING: genes are not in var_names and ignored: ['ANGTPL4', 'CXCL5', 'PPARG']
WARNING: genes are not in var_names and ignored: ['CSF2', 'SPP4', 'IFNA1', 'TNFSF11']
WARNING: genes are not in var_names and ignored: ['IL17A', 'IL21', 'IL22']
WARNING: genes are not in var_names and ignored: ['CCR8', 'CSF2', 'CSCR4', 'HAVCR1', 'IL13', 'IL4', 'IL5']
WARNING: genes are not in var_names and ignored: ['TRAC']
WARNING: genes are not in var_names and ignored: ['TRGC1', 'TRDC', 'TRDV2', 'TRDV1']
WARNING: provided gene list has length 0, scores as 0
WARNING: genes are not in var_names and ignored: ['TRGC2']
WARNING: genes are not in var_names and ignored: ['CXCL9']
WARNING: genes are not in var_names and ignored: ['FLAMF1']
WARNING: genes are not in var_names and ignored: ['LRRC32']
WARNING: genes are not in var_names and ignored: ['CCXR3']
WARNING: genes are not in var_names and ignored: ['CCL22', 'CCL17', 'CCL19']
WARNING: genes are not in var_names and ignored: ['XCR1']
WARNING: genes are not in var_names and ignored: ['PLET1', 'XCR1']
WARNING: genes are not in var_names and ignored: ['EPCAM', 'SIGLECG', 'PLET1', 'PPP1R1A']
WARNING: genes are not in var_names and ignored: ['ARG1']
WARNING: genes are not in var_names and ignored: ['SIGLECH']
WARNING: genes are not in var_names and ignored: ['C7', 'SIGLECG']
In [39]:
adata_predicted.var_names
Out[39]:
Index(['HES4', 'ISG15', 'TNFRSF18', 'TNFRSF4', 'MIB2', 'MMP23B', 'PLCH2',
       'C1orf174', 'UTS2', 'GPR157',
       ...
       'CD127_PROT', 'CD194_PROT', 'CD274_PROT', 'CD33_PROT', 'CD7_PROT',
       'CD80_PROT', 'CD86_PROT', 'CD183_PROT', 'CD34_PROT', 'CD20_PROT'],
      dtype='object', length=1271)
In [40]:
scores = [x for x in adata_predicted.obs.columns if 'CD45' in x]
scores
Out[40]:
['score_HumanCD45p_scseqCMs6_ActB_scanpy',
 'score_HumanCD45p_scseqCMs6_Activation_scanpy',
 'score_HumanCD45p_scseqCMs6_Basophil_scanpy',
 'score_HumanCD45p_scseqCMs6_Bcells_scanpy',
 'score_HumanCD45p_scseqCMs6_CCG1S_scanpy',
 'score_HumanCD45p_scseqCMs6_CCG2M_scanpy',
 'score_HumanCD45p_scseqCMs6_Cafs_scanpy',
 'score_HumanCD45p_scseqCMs6_Cellcycle_scanpy',
 'score_HumanCD45p_scseqCMs6_Checkpoint_scanpy',
 'score_HumanCD45p_scseqCMs6_Cyto_scanpy',
 'score_HumanCD45p_scseqCMs6_Cytotox_scanpy',
 'score_HumanCD45p_scseqCMs6_DCR_scanpy',
 'score_HumanCD45p_scseqCMs6_DCrec_scanpy',
 'score_HumanCD45p_scseqCMs6_DCs_scanpy',
 'score_HumanCD45p_scseqCMs6_Eff_scanpy',
 'score_HumanCD45p_scseqCMs6_Endo_scanpy',
 'score_HumanCD45p_scseqCMs6_Endot_scanpy',
 'score_HumanCD45p_scseqCMs6_Endothelial_scanpy',
 'score_HumanCD45p_scseqCMs6_Eosinophil_scanpy',
 'score_HumanCD45p_scseqCMs6_Epith_scanpy',
 'score_HumanCD45p_scseqCMs6_ExhB_scanpy',
 'score_HumanCD45p_scseqCMs6_Granulo_scanpy',
 'score_HumanCD45p_scseqCMs6_HLA_scanpy',
 'score_HumanCD45p_scseqCMs6_HLAP_scanpy',
 'score_HumanCD45p_scseqCMs6_HLAS_scanpy',
 'score_HumanCD45p_scseqCMs6_Ifi_scanpy',
 'score_HumanCD45p_scseqCMs6_Ifng_scanpy',
 'score_HumanCD45p_scseqCMs6_Macrophage_scanpy',
 'score_HumanCD45p_scseqCMs6_Mast_scanpy',
 'score_HumanCD45p_scseqCMs6_Megakaryocytes_scanpy',
 'score_HumanCD45p_scseqCMs6_MelMelan_scanpy',
 'score_HumanCD45p_scseqCMs6_MelMesen_scanpy',
 'score_HumanCD45p_scseqCMs6_MemB_scanpy',
 'score_HumanCD45p_scseqCMs6_Memory_scanpy',
 'score_HumanCD45p_scseqCMs6_Mo14_scanpy',
 'score_HumanCD45p_scseqCMs6_Mo16_scanpy',
 'score_HumanCD45p_scseqCMs6_MoMa_scanpy',
 'score_HumanCD45p_scseqCMs6_Monocytes_scanpy',
 'score_HumanCD45p_scseqCMs6_Myelo_scanpy',
 'score_HumanCD45p_scseqCMs6_MyeloSubtype_scanpy',
 'score_HumanCD45p_scseqCMs6_NKT_scanpy',
 'score_HumanCD45p_scseqCMs6_NKcells_scanpy',
 'score_HumanCD45p_scseqCMs6_NKcyt_scanpy',
 'score_HumanCD45p_scseqCMs6_NKnai_scanpy',
 'score_HumanCD45p_scseqCMs6_Naive_scanpy',
 'score_HumanCD45p_scseqCMs6_NaiveB_scanpy',
 'score_HumanCD45p_scseqCMs6_Neutrophil_scanpy',
 'score_HumanCD45p_scseqCMs6_NonEff_scanpy',
 'score_HumanCD45p_scseqCMs6_OMyelo_scanpy',
 'score_HumanCD45p_scseqCMs6_Others_scanpy',
 'score_HumanCD45p_scseqCMs6_Plasma_scanpy',
 'score_HumanCD45p_scseqCMs6_Pyro_scanpy',
 'score_HumanCD45p_scseqCMs6_Stemmess_scanpy',
 'score_HumanCD45p_scseqCMs6_StemmessS_scanpy',
 'score_HumanCD45p_scseqCMs6_Stromal_scanpy',
 'score_HumanCD45p_scseqCMs6_T4CM_scanpy',
 'score_HumanCD45p_scseqCMs6_TAM_scanpy',
 'score_HumanCD45p_scseqCMs6_TAMCx_scanpy',
 'score_HumanCD45p_scseqCMs6_TEM_scanpy',
 'score_HumanCD45p_scseqCMs6_TMO_scanpy',
 'score_HumanCD45p_scseqCMs6_TMid_scanpy',
 'score_HumanCD45p_scseqCMs6_TNK_scanpy',
 'score_HumanCD45p_scseqCMs6_TStem_scanpy',
 'score_HumanCD45p_scseqCMs6_TStemhi_scanpy',
 'score_HumanCD45p_scseqCMs6_TSteml_scanpy',
 'score_HumanCD45p_scseqCMs6_TStemlo_scanpy',
 'score_HumanCD45p_scseqCMs6_TTh1_scanpy',
 'score_HumanCD45p_scseqCMs6_TTh17_scanpy',
 'score_HumanCD45p_scseqCMs6_TTh2_scanpy',
 'score_HumanCD45p_scseqCMs6_Tcd4_scanpy',
 'score_HumanCD45p_scseqCMs6_Tcd8_scanpy',
 'score_HumanCD45p_scseqCMs6_Tcells_scanpy',
 'score_HumanCD45p_scseqCMs6_Tcgd_scanpy',
 'score_HumanCD45p_scseqCMs6_Tcytox_scanpy',
 'score_HumanCD45p_scseqCMs6_Teff_scanpy',
 'score_HumanCD45p_scseqCMs6_Tfh_scanpy',
 'score_HumanCD45p_scseqCMs6_TilCM_scanpy',
 'score_HumanCD45p_scseqCMs6_Tpexh_scanpy',
 'score_HumanCD45p_scseqCMs6_Treg_scanpy',
 'score_HumanCD45p_scseqCMs6_Ttexh_scanpy',
 'score_HumanCD45p_scseqCMs6_Ubi_scanpy',
 'score_HumanCD45p_scseqCMs6_UnivExh_scanpy',
 'score_HumanCD45p_scseqCMs6_UnivMem_scanpy',
 'score_HumanCD45p_scseqCMs6_UnivNaive_scanpy',
 'score_HumanCD45p_scseqCMs6_aDCs_scanpy',
 'score_HumanCD45p_scseqCMs6_allSteml_scanpy',
 'score_HumanCD45p_scseqCMs6_cDC1_scanpy',
 'score_HumanCD45p_scseqCMs6_cDC2_scanpy',
 'score_HumanCD45p_scseqCMs6_cDCs_scanpy',
 'score_HumanCD45p_scseqCMs6_epDCs_scanpy',
 'score_HumanCD45p_scseqCMs6_general_scanpy',
 'score_HumanCD45p_scseqCMs6_moDC_scanpy',
 'score_HumanCD45p_scseqCMs6_pDCs_scanpy',
 'score_HumanCD45p_scseqCMs6_uDCs_scanpy']

Indeed it seems like the classification of B cells is an improvement, whereas the varieties of T cells pose difficulties.

In [41]:
sc.pl.umap(adata_predicted, color= ["score_HumanCD45p_scseqCMs6_MemB_scanpy", "score_HumanCD45p_scseqCMs6_NaiveB_scanpy","CD4", "CD8A"], ncols = 2, wspace = 0.4, color_map = 'viridis',save= '.fig3_markers.svg')
WARNING: saving figure to file figures/umap.fig3_markers.svg
In [42]:
sc.pl.umap(adata_predicted, color= ["IL7R"], ncols = 2, wspace = 0.4, color_map = 'viridis')
In [ ]: